Project Description¶

Pharmaceutical drugs play a very important role in maintaining our health. Because of this, it is crucial that these drugs are safe and cause little or no harmful side effects.

In this project, we create independent reports on drug safety, without relying on the drug manufacturers. We will perform several hypothesis tests using Python to find out if the adverse reactions to a hypothetical drug are significant. We will also investigate whether factors like age have a significant effect on these adverse reactions.

The dataset we use contains information about five specific adverse effects, as well as demographic data, vital signs, and other measurements. Our main focus is on the drug’s adverse reactions, and we want to understand if these reactions occur in significant numbers.

The dataset, named drug_safety.csv, was obtained from Hbiostat, provided by the Vanderbilt University Department of Biostatistics. It includes five adverse effects: headache, abdominal pain, dyspepsia, upper respiratory infection, and chronic obstructive airway disease (COAD). The dataset also contains demographic details, vital signs, lab measurements, and more. The number of drug-treated observations is twice the number of placebo observations.

For this project, the dataset has been modified to show two key variables: adverse_effects, which indicates whether an individual had any adverse effects, and num_effects, which counts how many adverse effects each individual experienced.

The modified dataset includes the following columns:

Column Description
sex The gender of the individual
age The age of the individual
week The week during which the drug testing took place
trx The treatment group (Drug) or control group (Placebo)
wbc White blood cell count
rbc Red blood cell count
adverse_effects Whether the individual experienced at least one adverse effect
num_effects The total number of adverse effects experienced by the individual

The original dataset is available here. In this project, we will explore the data and answer important questions related to drug safety and adverse effects.

In [11]:
# import the required libraries 
import numpy as np
import pandas as pd
from statsmodels.stats.proportion import proportions_ztest
import pingouin
import seaborn as sns
import matplotlib.pyplot as plt
In [13]:
# load the dataset
drug_safety = pd.read_csv('drug_safety.csv', usecols = lambda col: col != 'index')
drug_safety.head()
Out[13]:
age sex trx week wbc rbc adverse_effects num_effects
0 62 male Drug 0 7.3 5.1 No 0
1 62 male Drug 1 NaN NaN No 0
2 62 male Drug 12 5.6 5.0 No 0
3 62 male Drug 16 NaN NaN No 0
4 62 male Drug 2 6.6 5.1 No 0

Determine if the proportion of adverse effects differs significantly between the Drug and Placebo groups¶

In [35]:
# Count the adverse_effects column values for each trx group
adv_eff_by_trx = drug_safety.groupby("trx").adverse_effects.value_counts()
adv_eff_by_trx
Out[35]:
trx      adverse_effects
Drug     No                 9703
         Yes                1024
Placebo  No                 4864
         Yes                 512
Name: count, dtype: int64
In [36]:
# Count the number of total rows in each group
adv_eff_by_trx_totals = adv_eff_by_trx.groupby("trx").sum()
adv_eff_by_trx_totals
Out[36]:
trx
Drug       10727
Placebo     5376
Name: count, dtype: int64
In [37]:
# Create an array of the "Yes" counts for each group
yeses = [adv_eff_by_trx["Drug"]["Yes"], adv_eff_by_trx["Placebo"]["Yes"]]
yeses
Out[37]:
[1024, 512]
In [38]:
# Create an array of the total number of rows in each group
n = [adv_eff_by_trx_totals["Drug"], adv_eff_by_trx_totals["Placebo"]]
n
Out[38]:
[10727, 5376]
In [39]:
# Perform a two-sided z-test on the two proportions
two_sample_results = proportions_ztest(yeses, n)
two_sample_results
Out[39]:
(0.0452182684494942, 0.9639333330262475)
In [48]:
# Extract the p-value
two_sample_p_value = two_sample_results[1]
two_sample_p_value
Out[48]:
0.9639333330262475

Find out if the number of adverse effects is independent of the treatment and control groups¶

In [42]:
# Determine if num_effects and trx are independent
num_effects_groups = pingouin.chi2_independence(
    data=drug_safety, x="num_effects", y="trx")

# Extract the p-value
num_effects_p_value = num_effects_groups[2]["pval"][0]

num_effects_p_value
C:\Users\newbe\anaconda3\Lib\site-packages\pingouin\contingency.py:151: UserWarning: Low count on observed frequencies.
  warnings.warn(f"Low count on {name} frequencies.")
C:\Users\newbe\anaconda3\Lib\site-packages\pingouin\contingency.py:151: UserWarning: Low count on expected frequencies.
  warnings.warn(f"Low count on {name} frequencies.")
Out[42]:
0.6150123339426765
In [43]:
# Create a histogram
sns.histplot(data=drug_safety, x="age", hue="trx")
C:\Users\newbe\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
Out[43]:
<Axes: xlabel='age', ylabel='Count'>
No description has been provided for this image
In [44]:
# confirm the histogram's output by conducting a normality test, To choose between unpaired t-test and Wilcoxon-Mann-Whitney test
normality = pingouin.normality(
    data=drug_safety,
    dv='age',
    group='trx',
    method='shapiro', # the default
    alpha=0.05) # 0.05 is also the default

normality
C:\Users\newbe\anaconda3\Lib\site-packages\scipy\stats\_morestats.py:1882: UserWarning: p-value may not be accurate for N > 5000.
  warnings.warn("p-value may not be accurate for N > 5000.")
C:\Users\newbe\anaconda3\Lib\site-packages\scipy\stats\_morestats.py:1882: UserWarning: p-value may not be accurate for N > 5000.
  warnings.warn("p-value may not be accurate for N > 5000.")
Out[44]:
W pval normal
trx
Drug 0.976785 2.189152e-38 False
Placebo 0.975595 2.224950e-29 False
In [45]:
# Select the age of the Drug group
age_trx = drug_safety.loc[drug_safety["trx"] == "Drug", "age"]

# Select the age of the Placebo group
age_placebo = drug_safety.loc[drug_safety["trx"] == "Placebo", "age"]


# Since the data distribution is not normal, lets conduct a two-sided Mann-Whitney U test
age_group_effects = pingouin.mwu(age_trx, age_placebo)

age_group_effects
Out[45]:
U-val alternative p-val RBC CLES
MWU 29149339.5 two-sided 0.256963 0.01093 0.505465
In [46]:
# Extract the p-value
age_group_effects_p_value = age_group_effects["p-val"]
age_group_effects_p_value
Out[46]:
MWU    0.256963
Name: p-val, dtype: float64

Final Summary¶

The Drug does not cause more side effects than the Placebo (p-value = 0.9639).

The number of side effects is not related to the Drug or Placebo group (p-value = 0.6150).

The Drug and Placebo groups do not have a significant age difference (p-value = 0.2569).

This suggests that the Drug is just as safe as the Placebo because the side effects are similar in both groups.